Getting Started

作业介绍
- recitation-测试
- homework-改进代码
提前设置
```
make <target> DEBUG=1
```

Recitation: Perf and Cachegrind

Perf

安装 Perf

出现了问题，已解决，但是没有记录

Perf 个人使用速查

perf record
用法：
$ perf record <program_name> <program_arguments>

running ./isort n k will sort an array of n elements k times.

$ sudo perf record ./isort 10000 10
Sorting 10000 values...
Done!
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.035 MB perf.data (887 samples) ]

perf report
用法：直接perf report

$ sudo perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 887  of event 'cpu-clock:pppH'
# Event count (approx.): 221750000
#
# Overhead  Command  Shared Object      Symbol
# ........  .......  .................  .................
#
    99.44%  isort    isort              [.] isort
     0.23%  isort    libc-2.27.so       [.] rand_r
     0.11%  isort    [kernel.kallsyms]  [k] queue_work_on
     0.11%  isort    [kernel.kallsyms]  [k] release_pages
     0.11%  isort    isort              [.] main

这里就可以看到 isort 占据了很大一部分时间

Cachegrind

Cachegrind (a Valgrind tool) is a cache and branch-prediction profiler On virtual environments like those on AWS, hardware events providing information about branches and cache misses are often unavailable, so perf may not be helpful.

valgrind
使用：
$ valgrind --tool=cachegrind --branch-sim=yes <program_name> <program_arguments>
输出解释：
D1 represents the lowest-level cache (L1)
LL represents the last (highest) level data cache (on most machines, L3)

例子代码

// Copyright (c) 2012 MIT License by 6.172 Staff

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

typedef uint32_t data_t;
const int U = 10000000;   // size of the array. 10 million vals ~= 40MB
const int N = 100000000;  // number of searches to perform

int main() {
  data_t* data = (data_t*) malloc(U * sizeof(data_t));
  if (data == NULL) {
    free(data);
    printf("Error: not enough memory\n");
    exit(-1);
  }

  // fill up the array with sequential (sorted) values.
  int i;
  for (i = 0; i < U; i++) {
    data[i] = i;
  }

  printf("Allocated array of size %d\n", U);
  printf("Summing %d random values...\n", N);

  data_t val = 0;
  data_t seed = 42;
  for (i = 0; i < N; i++) {
    int l = rand_r(&seed) % U;
    val = (val + data[l]);
  }

  free(data);
  printf("Done. Value = %d\n", val);
  return 0;
}

编译

$ make sum

剖析

$ valgrind --tool=cachegrind --branch-sim=yes ./sum

输出

==125== Cachegrind, a cache and branch-prediction profiler
==125== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==125== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==125== Command: ./sum
==125==
--125-- warning: L3 cache found, using its data for the LL simulation.
Allocated array of size 10000000
Summing 100000000 random values...
Done. Value = 938895920
==125==
==125== I   refs:      3,440,227,630
==125== I1  misses:            1,195
==125== LLi misses:            1,180
==125== I1  miss rate:          0.00%
==125== LLi miss rate:          0.00%
==125==
==125== D   refs:        610,074,405  (400,058,343 rd   + 210,016,062 wr)
==125== D1  misses:      100,507,416  ( 99,881,550 rd   +     625,866 wr)
==125== LLd misses:       37,859,997  ( 37,234,195 rd   +     625,802 wr)
==125== D1  miss rate:          16.5% (       25.0%     +         0.3%  )
==125== LLd miss rate:           6.2% (        9.3%     +         0.3%  )
==125==
==125== LL refs:         100,508,611  ( 99,882,745 rd   +     625,866 wr)
==125== LL misses:        37,861,177  ( 37,235,375 rd   +     625,802 wr)
==125== LL miss rate:            0.9% (        1.0%     +         0.3%  )
==125==
==125== Branches:        210,043,840  (110,043,336 cond + 100,000,504 ind)
==125== Mispredicts:           5,456  (      5,248 cond +         208 ind)
==125== Mispred rate:            0.0% (        0.0%     +         0.0%   )

查找 Cache 信息

lscpu
用法：
$ lscpu
作用：
find information about your CPU and its caches

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               154
Model name:          12th Gen Intel(R) Core(TM) i7-12700H
Stepping:            3
CPU MHz:             2687.998
BogoMIPS:            5375.99
Virtualization:      VT-x
Hypervisor vendor:   Microsoft
Virtualization type: full
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            24576K
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm serialize flush_l1d arch_capabilities

在我们这里就可以看到 L1、L2 和 L3 的 Cache 信息了

再观察输出结果，因为我们写 Cache miss 少（代码顺序写）和读 Cache miss 多（代码随机读）然后我们通过改变 U 和 N，控制 U 小于 L1、L2、L3 Cache 等情况，就能看出 N、U 大小和 D、L、LLD Cache miss 的关系

Homework: Sorting

Write-up 1:

Compare the Cachegrind output on the DEBUG=1 code versus DEBUG=0 compiler optimized code. Explain the advantages and disadvantages of using instruction count as a substitute for time when you compare the performance of different versions of this program

$ make
clang main.c tests.c util.c isort.c sort_a.c sort_c.c sort_i.c sort_p.c sort_m.c sort_f.c -O3 -DNDEBUG -g -Wall -std=gnu99 -gdwarf-3 -always-inline -lrt -lm  -o sort
clang: warning: argument unused during compilation: '-always-inline' [-Wunused-command-line-argument]

$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==184== Cachegrind, a cache and branch-prediction profiler
==184== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==184== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==184== Command: ./sort 10000 10
==184==
--184-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.014364 sec
sort_a repeated : Elapsed execution time: 0.013956 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.027513 sec
sort_a repeated : Elapsed execution time: 0.027134 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==184==
==184== I   refs:      235,925,874
==184== I1  misses:          1,586
==184== LLi misses:          1,467
==184== I1  miss rate:        0.00%
==184== LLi miss rate:        0.00%
==184==
==184== D   refs:       87,546,459  (52,667,725 rd   + 34,878,734 wr)
==184== D1  misses:        228,442  (   127,084 rd   +    101,358 wr)
==184== LLd misses:          5,123  (     2,420 rd   +      2,703 wr)
==184== D1  miss rate:         0.3% (       0.2%     +        0.3%  )
==184== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==184==
==184== LL refs:           230,028  (   128,670 rd   +    101,358 wr)
==184== LL misses:           6,590  (     3,887 rd   +      2,703 wr)
==184== LL miss rate:          0.0% (       0.0%     +        0.0%  )
==184==
==184== Branches:       40,474,926  (38,773,975 cond +  1,700,951 ind)
==184== Mispredicts:     2,484,878  ( 2,484,522 cond +        356 ind)
==184== Mispred rate:          6.1% (       6.4%     +        0.0%   )

$ make clean
rm -f ./sort *.std* *.gcov *.gcda *.gcno default.profraw

$ make DEBUG=1
clang main.c tests.c util.c isort.c sort_a.c sort_c.c sort_i.c sort_p.c sort_m.c sort_f.c -DDEBUG -O0 -g -Wall -std=gnu99 -gdwarf-3 -always-inline -lrt -lm  -o sort
clang: warning: argument unused during compilation: '-always-inline' [-Wunused-command-line-argument]

$  valgrind --tool=cachegrind --
ranch-sim=yes ./sort 10000 10valgrind: no program specified
valgrind: Use --help for more information.
cjl@ChenJulian:~/solution/mit6.172/Homework/HW2/homework$  valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==206== Cachegrind, a cache and branch-prediction profiler
==206== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==206== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==206== Command: ./sort 10000 10
==206==
--206-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.025243 sec
sort_a repeated : Elapsed execution time: 0.025381 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.049681 sec
sort_a repeated : Elapsed execution time: 0.049848 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==206==
==206== I   refs:      408,412,706
==206== I1  misses:          1,553
==206== LLi misses:          1,457
==206== I1  miss rate:        0.00%
==206== LLi miss rate:        0.00%
==206==
==206== D   refs:      260,697,695  (194,384,270 rd   + 66,313,425 wr)
==206== D1  misses:        228,025  (    126,767 rd   +    101,258 wr)
==206== LLd misses:          5,115  (      2,411 rd   +      2,704 wr)
==206== D1  miss rate:         0.1% (        0.1%     +        0.2%  )
==206== LLd miss rate:         0.0% (        0.0%     +        0.0%  )
==206==
==206== LL refs:           229,578  (    128,320 rd   +    101,258 wr)
==206== LL misses:           6,572  (      3,868 rd   +      2,704 wr)
==206== LL miss rate:          0.0% (        0.0%     +        0.0%  )
==206==
==206== Branches:       45,174,708  ( 43,473,813 cond +  1,700,895 ind)
==206== Mispredicts:     3,109,490  (  3,109,160 cond +        330 ind)
==206== Mispred rate:          6.9% (        7.2%     +        0.0%   )

得出结果

DEBUG	0	1
时间	快	慢
指令数	少	多
Cache Misses 率	大	小
MISpredicts 率	小	大

O0 牺牲了 cache 命中率，但是指令数少，时间快

inlining

You would like to see how much inline functions can help. Copy over the code from sort_a.c into sort_i.c, and change all the routine names from _a to _i. Using the inline keyword, inline one or more of the functions in sort_i.c and util.c. To add the sort_i routine to the testing suite, uncomment the line in main.c, under testFunc, that specifies sort_i. Profile and annotate the inlined program.

Write-up 2:

Explain which functions you chose to inline and report the performance differences you observed between the inlined and uninlined sorting routines.

如题，将 sort_i 中的 merge_i 和 copy_i 设置为 inline。并进行测试

编译和 perf record 的过程不再给出

no use inlining & DEBUG=1

cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf record ./sort 10000 10

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.000694 sec
sort_a repeated : Elapsed execution time: 0.000679 sec
sort_i          : Elapsed execution time: 0.000679 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.000950 sec
sort_a repeated : Elapsed execution time: 0.000989 sec
sort_i          : Elapsed execution time: 0.000942 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.006 MB perf.data (116 samples) ]

cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 116  of event 'cpu-clock:uhpppH'
# Event count (approx.): 29000000
#
# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ...........................
#
    43.10%  sort     sort           [.] sort_a
    17.24%  sort     libc-2.27.so   [.] cfree@GLIBC_2.2.5
    16.38%  sort     libc-2.27.so   [.] malloc
    15.52%  sort     sort           [.] sort_i
     2.59%  sort     sort           [.] mem_alloc
     1.72%  sort     sort           [.] mem_free
     0.86%  sort     libc-2.27.so   [.] intel_check_word.isra.0
     0.86%  sort     libc-2.27.so   [.] rand_r
     0.86%  sort     sort           [.] free@plt
     0.86%  sort     sort           [.] malloc@plt


#
# (Tip: Show current config key-value pairs: perf config --list)
#

cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==698== Cachegrind, a cache and branch-prediction profiler
==698== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==698== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==698== Command: ./sort 10000 10
==698==
--698-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.014621 sec
sort_a repeated : Elapsed execution time: 0.014605 sec
sort_i          : Elapsed execution time: 0.014851 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.028227 sec
sort_a repeated : Elapsed execution time: 0.028304 sec
sort_i          : Elapsed execution time: 0.028753 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==698==
==698== I   refs:      351,917,014
==698== I1  misses:          1,624
==698== LLi misses:          1,494
==698== I1  miss rate:        0.00%
==698== LLi miss rate:        0.00%
==698==
==698== D   refs:      130,160,086  (78,415,144 rd   + 51,744,942 wr)
==698== D1  misses:        324,704  (   184,171 rd   +    140,533 wr)
==698== LLd misses:          5,123  (     2,421 rd   +      2,702 wr)
==698== D1  miss rate:         0.2% (       0.2%     +        0.3%  )
==698== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==698==
==698== LL refs:           326,328  (   185,795 rd   +    140,533 wr)
==698== LL misses:           6,617  (     3,915 rd   +      2,702 wr)
==698== LL miss rate:          0.0% (       0.0%     +        0.0%  )
==698==
==698== Branches:       60,182,795  (57,681,766 cond +  2,501,029 ind)
==698== Mispredicts:     3,703,747  ( 3,703,346 cond +        401 ind)
==698== Mispred rate:          6.2% (       6.4%     +        0.0%   )

use inlining & DEBUG=1

$ cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf record ./sort 10000 10

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.001043 sec
sort_a repeated : Elapsed execution time: 0.001049 sec
sort_i          : Elapsed execution time: 0.001048 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.001652 sec
sort_a repeated : Elapsed execution time: 0.001609 sec
sort_i          : Elapsed execution time: 0.001595 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.009 MB perf.data (203 samples) ]

cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 203  of event 'cpu-clock:uhpppH'
# Event count (approx.): 50750000
#
# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ................................
#
    34.98%  sort     sort           [.] merge_a
    15.27%  sort     sort           [.] merge_i
    12.32%  sort     sort           [.] copy_a
     9.85%  sort     libc-2.27.so   [.] cfree@GLIBC_2.2.5
     8.87%  sort     sort           [.] copy_i
     5.42%  sort     sort           [.] sort_a
     3.45%  sort     libc-2.27.so   [.] malloc
     1.97%  sort     sort           [.] copy_data
     1.97%  sort     sort           [.] mem_alloc
     1.97%  sort     sort           [.] sort_i
     0.99%  sort     sort           [.] init_data
     0.99%  sort     sort           [.] mem_free
     0.49%  sort     ld-2.27.so     [.] strcmp
     0.49%  sort     libc-2.27.so   [.] __fxstat64
     0.49%  sort     libc-2.27.so   [.] __memmove_avx_unaligned_erms
     0.49%  sort     sort           [.] malloc@plt


#
# (Tip: Customize output of perf script with: perf script -F event,ip,sym)
#

cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==758== Cachegrind, a cache and branch-prediction profiler
==758== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==758== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==758== Command: ./sort 10000 10
==758==
--758-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.025676 sec
sort_a repeated : Elapsed execution time: 0.025433 sec
sort_i          : Elapsed execution time: 0.025869 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_a          : Elapsed execution time: 0.050344 sec
sort_a repeated : Elapsed execution time: 0.050162 sec
sort_i          : Elapsed execution time: 0.050949 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==758==
==758== I   refs:      608,332,662
==758== I1  misses:          1,574
==758== LLi misses:          1,481
==758== I1  miss rate:        0.00%
==758== LLi miss rate:        0.00%
==758==
==758== D   refs:      388,800,501  (289,890,848 rd   + 98,909,653 wr)
==758== D1  misses:        323,971  (    183,785 rd   +    140,186 wr)
==758== LLd misses:          5,122  (      2,419 rd   +      2,703 wr)
==758== D1  miss rate:         0.1% (        0.1%     +        0.1%  )
==758== LLd miss rate:         0.0% (        0.0%     +        0.0%  )
==758==
==758== LL refs:           325,545  (    185,359 rd   +    140,186 wr)
==758== LL misses:           6,603  (      3,900 rd   +      2,703 wr)
==758== LL miss rate:          0.0% (        0.0%     +        0.0%  )
==758==
==758== Branches:       67,436,323  ( 64,935,328 cond +  2,500,995 ind)
==758== Mispredicts:     4,682,231  (  4,681,844 cond +        387 ind)
==758== Mispred rate:          6.9% (        7.2%     +        0.0%   )

我们可以看到内联函数会导致时间花费更长。

在这个写作任务中，你需要解释递归函数内联化可能导致的性能下降，并说明使用 Cachegrind 收集的分析数据如何帮助你测量这些负面性能影响。

递归函数内联化的潜在性能问题包括：

代码膨胀：内联递归函数会导致代码膨胀，因为每次递归调用都会展开为相应的代码，这可能会导致生成更多的指令。代码膨胀可能会增加指令缓存的压力，降低缓存命中率。
栈消耗：递归函数通常使用函数调用栈来保存每个递归调用的状态。内联递归函数可能会导致栈消耗过大，尤其在递归深度较大时。较大的栈消耗可能会导致栈溢出或者减慢程序的执行速度。
冗余计算：内联展开递归函数的过程中，可能会进行一些冗余计算，因为相同的计算可能在不同的展开代码中多次出现。这会增加指令执行的开销，降低程序的效率。

Pointers vs Arrays

Write-up 4

Give a reason why using pointers may improve performance. Report on any performance differences you observed in your implementation.

指针

Coarsening

Write-up 5

Explain what sorting algorithm you used and how you chose the number of elements to be sorted in the base case. Report on the performance differences you observed.

未优化

代码

static inline void merge_m(data_t *A, int p, int q, int r)
{
  assert(A);
  assert(p <= q);
  assert((q + 1) <= r);
  int n1 = q - p + 1;
  int n2 = r - q;

  data_t *left = 0, *right = 0;
  mem_alloc(&left, n1 + 1);
  mem_alloc(&right, n2 + 1);
  if (left == NULL || right == NULL)
  {
    mem_free(&left);
    mem_free(&right);
    return;
  }

  copy_m(&(A[p]), left, n1);
  copy_m(&(A[q + 1]), right, n2);
  left[n1] = UINT_MAX;
  right[n2] = UINT_MAX;

  int i = 0;
  int j = 0;

  for (int k = p; k <= r; k++)
  {
    if (left[i] <= right[j])
    {
      A[k] = left[i];
      i++;
    }
    else
    {
      A[k] = right[j];
      j++;
    }
  }
  mem_free(&left);
  mem_free(&right);
}

cjl@ChenJulian:/mnt/c/Data Files/mit6.172/Homework/HW2/homework$ valgrind --tool=cachegrind --branch-sim=yes ./sort 10000 10
==41== Cachegrind, a cache and branch-prediction profiler
==41== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==41== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==41== Command: ./sort 10000 10
==41==
--41-- warning: L3 cache found, using its data for the LL simulation.

Running test #0...
Generating random array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_m          : Elapsed execution time: 0.031639 sec
Generating inverted array of 10000 elements
Arrays are sorted: yes
 --> test_correctness at line 217: PASS
sort_m          : Elapsed execution time: 0.062058 sec

Running test #1...
 --> test_zero_element at line 245: PASS

Running test #2...
 --> test_one_element at line 266: PASS
Done testing.
==41==
==41== I   refs:      208,493,318
==41== I1  misses:          1,554
==41== LLi misses:          1,458
==41== I1  miss rate:        0.00%
==41== LLi miss rate:        0.00%
==41==
==41== D   refs:      132,595,246  (98,877,918 rd   + 33,717,328 wr)
==41== D1  misses:        131,781  (    69,494 rd   +     62,287 wr)
==41== LLd misses:          5,116  (     2,413 rd   +      2,703 wr)
==41== D1  miss rate:         0.1% (       0.1%     +        0.2%  )
==41== LLd miss rate:         0.0% (       0.0%     +        0.0%  )
==41==
==41== LL refs:           133,335  (    71,048 rd   +     62,287 wr)
==41== LL misses:           6,574  (     3,871 rd   +      2,703 wr)
==41== LL miss rate:          0.0% (       0.0%     +        0.0%  )
==41==
==41== Branches:       22,912,991  (22,012,175 cond +    900,816 ind)
==41== Mispredicts:     1,558,061  ( 1,557,736 cond +        325 ind)
==41== Mispred rate:          6.8% (       7.1%     +        0.0%   )

优化后：

报错,感觉有点花费时间。先不细究了。

mit-6-172-hw2